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Abstract 

Background: Using microarray and sequencing platforms, a large number of copy number variations (CNVs) have 
been identified in humans. In practice, because our human genome is a diploid, these platforms are limited to or 
more accurate for detecting total copy numbers rather than chromosome-specific copy numbers at each of the 
two homologous chromosomes. Nevertheless, the analysis of linkage disequilibrium (LD) between CNVs and SNPs 
indicates that distinct copy numbers often sit on their own background haplotypes. 

Results: We propose new computational models for inferring chromosome-specific copy numbers by 
distinguishing background haplotypes of each copy number. The formulated problems are shown to be NP-hard 
and approximation/heuristic algorithms are developed. Simulation indicates that our method is accurate and 
outperforms the existing approach. By testing the program in 60 parent-offspring trios, the inferred chromosome- 
specific copy numbers are highly consistent with the law of Mendelian inheritance. The distributions of copy 
numbers at chromosomal level are provided for 270 individuals in three HapMap panels. 

Conclusions: The estimation of chromosome-specific copy numbers using microarray or sequencing platforms was 
often confounded by a number of factors. This study showed that the integration of background haplotypes is 
able to improve the accuracies of copy number estimation at chromosome level, especially for the CNVs having 
strong LD with SNPs in proximity. 



Background 

Genetic variations exist in many forms in the human 
genome. Large structural variations such as deletions 
and duplications are quite common in the human popu- 
lations, which encompass more base pairs than single 
nucleotide polymorphisms (SNPs). Among various types 
of structural variations, copy number variations (CNVs) 
often occupy regulatory regions of genes and greatly 
influence phenotypic traits and disease susceptibility [1]. 
CNV is defined as a DNA segment with length more 
than 1 kb and observed with various numbers of copies 
in the population. A number of CNVs have been known 
to highly associate with several complex diseases such as 
HIV infection, autoimmunity, autism, Parkinson's, Alz- 
heimer's and Crohn's disease [2-6]. 

The advance of high-throughput array platforms and 
sequencing technologies enables fast and cost-effective 
scan of CNVs in genome-wide scale [7]. Using array 
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Comparative Genomic Hybridization (aCGH) and next- 
generation sequencing, a large number of CNVs have 
been identified in human and other primates [8-12]. In 
practice, because our human genome is a diploid, most 
sequencing platforms often report total copy numbers of 
one individual instead of chromosome-specific copy 
numbers presented on each of the two homologous 
chromosomes. For example, suppose there are two 
diplotype configurations at one CNV locus: 1/1 repre- 
sents one copy at each of the two chromosomes, and 0/ 
2 indicates a deletion at one chromosome and a duplica- 
tion at the other. The total copy numbers of these con- 
figurations are both experimentally obtained as two, 
although the underlying mechanisms generating these 
two configurations are different. Nevertheless, determi- 
nation of chromosome-specific copy numbers is impor- 
tant in the analysis of population genetics and disease 
association studies. For instance, the power of detecting 
positive selection and accuracy of measuring Linkage 
Disequilibrium (LD) between SNPs and CNVs can be 
improved through direct use of chromosome-specific 
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copy numbers [13-15]. Moreover, identification of the 
chromosome-specific copy numbers can even shed light 
on the age of each CNV [1], 

Recently, an expectation maximization (EM) algorithm 
was developed to estimate frequencies of chromosome- 
specific copy numbers under the assumption of Hardy- 
Weinberg equilibrium (HWE) [16]. In reality, the 
observed allele frequencies do not completely satisfy 
HWE, because each copy number allele may be sampled 
more or less in different sequencing projects. In a few 
occasions, HWE may be even deviated due to direc- 
tional selection, assortative mating, or migration [17]. 
Using B allele frequencies (BAF) and log R ratios (LLR) 
provided by SNP array platforms, a hidden Markov 
model (HMM) was designed for inferring chromosome- 
specific copy numbers within a parent-offspring family 
[18]. In addition, information of allelic-specific copies at 
each SNP locus (e.g., AAABB) have been also used to 
indirectly infer chromosome-specific copy numbers [19]. 
However, BAF, LLR and allelic-specific copies are not 
always available in each sequencing platform. For exam- 
ple, in next-generation sequencing (e.g., SOLid and Illu- 
mina), SNPs and CNVs called at these platforms (e.g., 
Bioscope and SAMTools) do not provide such informa- 
tion. Moreover, the accuracy of allelic-specific copies is 
often decreased for higher copies and is worse than that 
of total copy numbers due to cross-hybridization 
[20,21]. Although traditional haplotype phasing pro- 
grams (e.g., fastPHASE) may be used for inferring copy 
number by encoding bi-allelic CNVs into SNP geno- 
types, this approach is inadequate to infer multi-allelic 
CNVs [13,19]. 

To date, the analysis of LD structure in human gen- 
ome indicated that many CNVs have strong LD with 
SNPs in proximity, probably owing to uneven distribu- 
tion of recombination hot/cold spots or genetic hitch- 
hiking [22-25]. Moreover, a number of CNVs have been 
shown to be taggable using alleles at flanking SNPs [3]. 
The LD structure between CNVs and SNPs implies dif- 
ferent chromosome-specific copy numbers often sit at 
their own background haplotypes, which can be viewed 
as fingerprints of each copy number. As a consequence, 
chromosome-specific copy numbers of each individual 
are inferable by careful analysis of background haplo- 
types around each CNV. In recent years, several large- 
scale sequencing projects have constructed complete 
haplotype and CNV databases across major human 
populations (e.g., HapMap [26]). Integration of these 
databases may gain insight into the distribution of chro- 
mosome-specific copy numbers in human populations. 

In this study, we develop new computational models 
and combinatorial algorithms for inferring chromo- 
some-specific copy numbers by distinguishing back- 
ground haplotypes of each copy number. Two 



optimization problems are formulated, shown to be NP- 
hard, and solved by approximation or heuristic algo- 
rithms. Simulation indicates our method is very accurate 
and is able to outperform existing approach. By testing 
the program separately for each individual within 60 
parent-offspring trios, the inferred chromosome-specific 
copy numbers are highly consistent with the law of 
Mendelian inheritance. The distribution of chromo- 
some-specific copy numbers across three human popu- 
lations indicate that one copy is the major allele as 
expected, and zero copy (deletion) alleles are much fre- 
quent than high copy (duplication) alleles. 

Methods 

The haplotypes of 270 individuals are downloaded from 
the Phase II of international HapMap project [26]. For 
the input of unphased genotypes, the haplotypes were 
inferred via the PHASE [27] program, which was used 
by the HapMap project. For high-throughput sequen- 
cing data, a number of haplotype assembly tools can 
also be used to infer the haplotypes [28]. The total copy 
numbers of 1,319 CNVs typing on the same individuals 
are retrieved from [9]. We extract SNPs within each 
CNV as well as SNPs at flanking regions in our study. 
We compared the SNP distance (i.e. number of SNPs) 
with the physical distance (e.g., 5 kb) for capturing the 
extent of LD and found that the LD is more sensitive to 
physical distance. The simulation results indicated that 
the accuracy of our algorithm is highest when including 
SNPs within one-fold extension of the physical size of 
each CNV (Additional file 1, Figure SI). Therefore, the 
released program will automatically checks the coordi- 
nates of CNVs and SNPs and captures SNPs within the 
one-fold extension regions into consideration. 

Given a set of haplotype pairs and the total copy num- 
ber for each individual (Figure 1), the chromosome-spe- 
cific copy number of each haplotype is determined by 
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first solving a variant of Max-£-Cut problem, which 
aims to divide background haplotypes into k clusters. 
Then, a variant of Constraint Satisfaction Problem 
(CSP) is solved to assign chromosome-specific copy 
number to each cluster. Finally, these two procedures 
are repeated for all possible k in order to determine the 
best solution. 

Haplotypes Clustering via Solving Constrained Max-fc-Cut 
Problem 

Through analysis of LD between SNPs and CNVs, the 
copy numbers on a CNV are shown to have strong LD 
with alleles at flanking SNPs [13,23,24]. The LD struc- 
ture implies different chromosome-specific copy num- 
bers tend to sit at their own background haplotypes. We 
first group haplotypes spanning across each CNV into k 
clusters (for all possible k) based on their pairwise ham- 
ming distance and total copy numbers. Note that odd 
total copy number implies the underlying two chromo- 
some-specific numbers should be different (e.g., 3 = 0 + 
3 or 1 + 2). Haplotypes clustered into the same set may 
represent haplotype background for the same chromo- 
some-specific copy number. The input total copy num- 
bers and haplotypes are formulated into a weighted 
graph described as following (see Figure 2): 

(1) Each haplotype is transformed into a vertex. 

(2) The weight of edge between two vertices is the 
hamming distance between two haplotypes. Note 
that the same haplotype from different individuals 
are formulated as multiple vertices with zero 
distance. 

(3) For the haplotype pair with odd total copy num- 
ber (e.g., H 1 and H 2 ), the edge between them is 
called hard edge. 

(4) For the other haplotype pairs, the edges between 
them are called soft edges. 

Given the above weighted graph with hard and soft 
edges, these haplotypes are grouped into k clusters by 
solving a variant of Max-^-Cut problem (called 




Figure 2 Formulation of CNVs/haplotypes into a weighted 
graph. Formulation of CNVs/haplotypes into a weighted graph with 
hard and soft edges. The dotted line denotes the hard edge, 
because the total copy of H> and H i is an odd number- 



constrained Max-/c-Cut). A formal definition of the con- 
strained Max-^-Cut problem is given below. 

Problem: Constrained Max-fc-Cut 

Given an undirected weighted graph G = {V, E) in which 
some edges in E are hard and the others are soft, the 
constrained Max-/c-Cut problem aims to find a partition 
of vertices in V into k sets (Xi, X 2 , X/^) such that the 
total weight of soft edges across different sets (called 
cut) is maximized, requiring all hard edges must be on 
the cut. 

The original Max-£-Cut is known to be NP-hard 
[29,30], which is a special case of this problem when all 
edges are soft. Therefore, the problem of constrained 
Max-/c-Cut is also NP-hard. In order to efficiently solve 
the constrained Max-/c-Cut problem, we develop a 
greedy approximation algorithm which explores larger 
solution space by randomizing non-deterministic steps. 
The core procedure of this algorithm is given below 
(Additional file 2, Figure S2). 

Algorithm for Constrained Max-fc-Cut 

(1) Randomly pick k different vertices as initial ele- 
ments for each k set (Xi, X 2 , X^). 

(2) Without violating the constraint of hard edge, 
randomly pick a remaining vertex and assign it into 
the set which maximizes the total weight of soft 
edges across different sets. This step is repeated 
until all vertices are assigned. 

Note that the above procedure involves non-determi- 
nistic parts in both steps (i.e., initial k vertices and the 
order of picking the next vertex). Therefore, this proce- 
dure is repeated ten times to explore larger solution 
space by trying different initial k vertices in step 1 and 
different order in step 2. The best solution among all 
trials is outputted as the final solution. The number of 
repeated iterations is usually a tradeoff between accu- 
racy and efficiency. Nevertheless, we found that the ran- 
domized approaches on top of the greedy framework 
requires only few iterations (Additional file 3, Figure 
S3). Thus, the implemented can run fast in practice. 
The following theorem implies that the solution found 
by this algorithm is quite close to the optimal solution. 

Theorem 1. The algorithm for constrained Max-k-Cut 
is a (k - 2)1 {k - l)-approximation algorithm for k > 2. 

Proof. Without loss of generality, let the order of pick- 
ing vertices be v 1( v 2 , v„. Let W denote the total 
weight of all edges in G and 

i-i 

Wj = ^weight(y„y m ) 
m=l 
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Let Xj be the j-th set of partitioned vertices and 
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Copy Number Assignment via Solving Constraint 
Satisfaction Problem 

After clustering haplotypes into k sets {X lt X 2 , X k ), we 
then assign k different integers to each set, which corre- 
spond to k distinct chromosome-specific copy numbers. 
For each individual, summation of chromosome-specific 
copy numbers of each haplotype pair should be equal to 
his/her total copy number, which can be written as the 
following constraint: 



Copy(H (1 ) + Copy(H, 2 ) = Total(i), 



(1) 



where Copy(H il ) and Copy(H /2 ) are chromosome-spe- 
cific copy numbers for the i-th individual, and Total(i) 
is his/her total copy number. For the example shown in 
Figure 3(A), Copy(H 5 )+Copy(H 6 ) = 2 for individual 3. 
Note that because all haplotypes have been clustered 
into the same or different sets, eq (1) can be rewritten 
into the following constraint using their set variables Xj-. 



X a +X b =Total(i), 



(2) 



where X a and X b denote the sets of these two haplo- 
types after clustering (e.g., X 1 + X 3 = 2 for individual 3). 
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Copy(Hl) + Copy(H2)= 1 
Copy(H3) + Copy(H4) = 1 
Copy(H5) + Copy(H6) = 2 
Copy(H7) + Copy(H8) = 2 
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Figure 3 Haplotype clustering and copy number assignment 

Examples of (A) haplotype clustering and (B) Assignment of 
chromosome-specific copy number. 
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For each individual, a set of constraints with two vari- 
ables similar to eq (2) can be generated by repeating the 
above formulation (see Figure 3(B)). By assigning dis- 
tinct integer numbers to these set variables X it chromo- 
some-specific copy number of each haplotype can then 
be determined. Theoretically, all constraints should be 
satisfied after the assignment, but practically, not all 
constraints can be satisfied, because some ambiguous 
haplotypes may not be the true background of the copy 
number. In order to satisfy as many constraints as possi- 
ble, chromosome-specific copy numbers are assigned to 
each set X t by solving a variant of the constraint satis- 
faction problem (termed Unique Max-2-CSP). Given a 
set of two-variable constraints over n variables (X\, 
X 2 ,..., X n ), the Unique Max-2-CSP problem asks for k 
unique (distinct) integers assigned to each variable 
which satisfied maximum number of constraints. 

Problem: Unique Max-2-CSP 

Given a set of variables X = {X it X 2 , X n }, a set of 
finite integer domains D = {0, 1, d], where d > n - 1, 
and a set of two-variable constraints C = {Q, C 2 , C m } 
with the following form: 



Q : Xi + Xj = Ti, foralll <Z<m, 



(3) 



where Ti is a non-negative integer. The Unique Max- 
2-CSP asks for an assignment of n distinct integers in D 
to X\, X 2 , X n that maximizes the total number of 
satisfied constraints in C. 

We first prove a problem called binary Max-2-CSP is 
NP-hard, in which the integer domain D is restricted to 
{0, 1}, and values assigned to different variables in X are 
allowed to be identical (e.g., Xi = X 2 = 1). 

Then, the unique Max-2-CSP problem is shown to be 
NP-hard by reduction from binary Max-2-CSP. The 
details of these proofs can be found in Additional file 4, 
Supplementary Material. 

Theorem 2. Unique Max-2-CSP is NP-hard. 

In order to solve unique Max-2-CSP more efficiently, 
we developed a greedy heuristic algorithm which also 
explores larger solution space by randomizing non- 
deterministic steps. Let n be the number of individuals 
and c max be the maximum possible copy number. 

Algorithm for Unique Max-2-CSP 

For 1 < i < n, 0 < c < c max , do step (1) to step (3) (see 
Figure 4). 

(1) Initially set X t = c. 

(2) Randomly pick a constraint {N : X a + X b = T } in 
which only X a (or X b ) is assigned, where N is the 
number of the constraint. If X a (or X b ) = D, and 
there are m types of constraints with X a + X b as 
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Step (4) 
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Figure 4 The algorithm of assigning chromosome-specific copy 
number. An example of the algorithm for assigning copy number 
by solving unique Max-2-CSP. 



following: {Nj : X a + X b = Tj }, where 1 < j < m, and 
Ni is maximum in Np assign X b (or X a ) = T l - D. 
Repeat this step until there is no constraint in which 
only one variable is assigned. 

(3) Compute the number of satisfied constraints with 
respect to X t = c. 

Ideally, once the value of the initial variable X t is 
assigned (e.g., Xi = 0), the values of other associated 
variables can be indirectly determined (e.g., X^ + X 2 = 1 
or X 1 + X 3 = 2). However, there could be some conflict- 
ing constraints existed (e.g., X 2 + X 3 = 3). Therefore, the 
possible values of all variables X t are dependent on the 
order of assignment (e.g., X\ = 0 first, X 2 = 1 second, 
then ...). In reality, there are more variables and the 
dependency/conflict relations are more complicated. 
Consequently, we repeat the above procedure ten times 
to explore different orders of assignments by randomly 
prioritizing distinct constraints to be satisfied in differ- 
ent rounds. The best solution among all iterations is 
recorded into the corresponding row in the solution 
table. Note that some variables may have no assignment 
due to conflicts with previously assigned variables and 
hence are recorded as -1. 

After the above procedure is iterated over possible 
initial values of all variables, a solution table will be 
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created. Each row stands for one assignment corre- 
sponding to the initial value of some variable. Note that 
although each row represents a set of possible assign- 
ment, the assignment may not satisfy all variables due 
to the lack of dependency with other variables (e.g., X 4 
may not be reachable from X x ). Therefore, we do step 
(4) iteratively using a greedy approach. 

(4) Select a row which is not chosen in the solution 
table with maximum number of satisfied constraints. 
Repeat this row selection until no further constraints 
can be satisfied. Note that the variables assigned in 
one iteration cannot violate the assignment in pre- 
vious iteration. 

Finally, the union of assignments selected by this 
greedy algorithm is outputted as the solution. 



unsatisfied individuals using the following randomized 
approach (see Figure 5 for an example): 

(1) Randomly pick an individual with haplotypes vio- 
lating the constraint and enumerate all possible 
assignments for these two haplotypes such that the 
constraint can be satisfied. 

(2) For each possible assignment, evaluate the new 
cut value in the Max-k-Cut problem and choose the 
assignment with maximum cut among all 
possibilities. 

(4) Repeat step (1) and step (2) for the remaining 
unsatisfied individuals until all of them are satisfied. 

Because the order of individuals processed is non- 
deterministic, we also repeat above procedure ten times 
and output the best solution among them. 



Iteration and Adjustment 

The previous two procedures (haplotype clustering and 
copy number assignment) are repeated for all possible 
numbers of clusters k, because the best setting of k can 
not be known in advance. We try all possible k from 
two to maximum possible number. For example, if a 
CNV have total copy number 2, 3, 4 in populations, the 
maximum possible k is 5 since all possible chromo- 
some-specific copy numbers range from 0 to 4. We 
choose the best k with maximum number of satisfied 
constraints in unique Max-2-CSP. In practice, the con- 
straints of some individuals may be still unsatisfied after 
these iterations, because the ambiguous haplotypes, 
which are not the true background of underlying copy 
number, may confuse the haplotype clustering. Conse- 
quently, we adjust the clustering results for these 
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Figure 5 Adjustment for the individuals with unsatisfied 
constraints. An example of adjusting clustering result for the 
individuals with unsatisfied constraint. The first constraint is 
unsatisfied for XI = 0, X2 = 1 and X3 = 2. Adjust the clustering 
result by putting HI in X1 and putting H2 in X2 for satisfying the 
first constraint and maximizing the cut size. 



Simulation 

The simulation of LD and HWE generated two series of 
copy numbers and SNP genotypes from 16 individuals. 
The haplotype phases are inferred via PHASE [27]. The 
first series of data sets simulate diplotype configurations 
completely match HWE (P = 1.0, Chi-square). The 
flanking SNPs are simulated starting from perfect LD 
(average r 2 = 1.0). Subsequently, the remaining data sets 
of lower LD are constructed by flipping SNP alleles at 
random. The second set of experiments simulated an 
imperfect HWE data sets by adding/deleting some copy 
number alleles from the HWE data sets, which aims to 
slightly deviate from the expected HWE frequency (P = 
0.98). The remaining data sets of LD decay are gener- 
ated in a similar way. 

The simulation using copy number on X Chromo- 
somes is also adopted. Because there is only one X chro- 
mosome in each male, the total copy number obtained 
on X chromosome directly represents the chromosome- 
specific copy number [9]. We use CNVs and haplotypes 
in X chromosomes of males from the HapMap project 
during simulation. The total copy numbers at one CNV 
is simulated by randomly pairing two copy numbers on 
two different X chromosomes. 

In order to compare the accuracy of our algorithm 
and CNVphaser [1,16], which outputs posterior prob- 
abilities of each copy number, we parsed the output files 
of CNVPhaser and picked up the diplotype configura- 
tion with highest probability for each individual. The 
accuracy of inferred copy number configurations is 
defined as following: 



Accuracy ■■ 



Ctotal 
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where C correct is the number of correctly inferred copy 
number configurations and C tota i is the total number of 
configurations. 

Results 

The proposed algorithms have been implemented as a 
program called CSCNPhaser, which is available at 
http:/ /www.cs.ccu.edu.tw/ -ythuang/Tool/ CSCNPhaser/. 
We retrieved haplotypes of 270 individuals from Phase 
II of the International HapMap Project [26]. These indi- 
viduals include 30 trios from the Utah, USA region 
(CEU); 30 trios from the Yoruba in Ibadan, Nigeria 
(YRI); 45 unrelated Japanese individuals from Tokyo, 
Japan (JPT); and 45 unrelated Han Chinese individuals 
from Beijing, China (CHB). In addition, total copy num- 
bers at 1,319 CNVs typing on the same 270 individuals 
are downloaded from [9]. We consider haplotypes 
within the CNV as well as haplotypes at flanking 
regions, whereas the best length of extended haplotypes 
is determined by simulation (see Method). 

Simulation on LD Decay and HWE 

We compared the CNVPhaser [1,16] and our program 
CSCNPhaser over two series of data sets with respect to 
different LD and HWE (see Method). Although the 
copy numbers in both experiments almost match the 
ideal HWE, the slight deviation from HWE is shown by 
the P values using the Chi-square test. The first set of 
experiments simulated complete HWE, in which the 
copy alleles in all data sets completely follow the 
expected frequency (P = 1.0). The flanking SNP alleles 
are randomly flipped to decay the LD. Figure 6(A) plots 
the accuracies of CSCNPhaser and CNVPhaser at differ- 
ent degrees of LD under complete HWE. Because 
CSCNPhaser is designed based on the LD of back- 
ground haplotypes, the accuracy is decreasing as the 
background haplotypes are less LD-informative. Unex- 
pectedly, we found the accuracy of CNVPhaser also 
deteriorates as LD decays. This is because CNVPhaser 
estimated the combined frequencies of the entire haplo- 
type and copy number to match HWE, which implicitly 
captured LD in a light way. CSCNPhaser outperforms 
CNVPhaser as the background haplotypes are more LD- 
informative, and both accuracies are worse as haplotypes 
are less informative. 

The second set of experiments simulated an imperfect 
HWE data sets by adding/deleting a few copy number 
alleles to slightly deviate from the expected HWE fre- 
quency (P = 0.98). Note that the entire allele frequency 
spectrum is still close to that of HWE. Figure 6(B) plots 
the accuracies of CSCNPhaser and CNVPhaser at differ- 
ent degrees of LD. In high LD, both programs can still 
achieve high accuracies. 
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Figure 6 Accuracies of inferred chromosome-specific copy 
number. (A) The average accuracy of CNVphaser and CSCNPhaser 
under complete HWE and various LD. (B) The average accuracies of 
CNVphaser and CSCNPhaser under near HWE and various LD. 



Although the major trends are similar to previous 
experiments, CNVPhaser is slightly worse than previous 
experiment compared with our method, implying it is 
more sensitive to HWE deviation. 

Consistency with Mendelian Inheritance 

The developed program is further applied on 1,292 
CNVs on autosomal chromosomes typing over 270 Hap- 
Map individuals from [9]. We discarded CNVs with less 
than 10 SNPs, because they are less informative about 
LD. There are 969 CNVs used in following experiments. 
The copy numbers observed among normal individuals 
should be overwhelmingly inherited from their parents. 
By running our program separately for each individual 
within 60 parent-offspring trios (CEU and YRI panels), 
correctness of our method can be justified by checking 
the Mendelian consistent rate of inferred chromosome- 
specific copy numbers within trios [9]. More than 97 
percent of CNVs have Mendelian consistent rate larger 
than 0.9 (see Figure 7). These results indicate that 
majority of copy numbers inferred by our method is 
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Consistent rates of Mendelian inheritance of inferred chromosome- 
specific copy numbers in 60 parent-offspring trios. 



under expectation from the law of Mendelian inheri- 
tance. The remaining few CNVs might imply novel dele- 
tions/duplications or translocation-mediated CNVs 
[31,32]. 



(A) Distribution of Chromosome-Specific Copy Numbers 
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(B) Distribution of Diplotype Configurations 
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Figure 8 Distributions of chromosome-specific copy numbers 
and diploid copy number configurations. Distributions of 
chromosome-specific copy numbers and diploid copy number 
configurations. (A) Distribution of chromosome-specific copy 
numbers in three HapMap panels. (B) Distribution of diploid 
configurations in three HapMap panels. 



Distributions of Copy Number Configurations in Human 
Populations 

The chromosome-specific copy numbers of 270 indivi- 
duals in CEU, YRI, and CHB+JBP HapMap panels are 
inferred by our program in order to investigate the dis- 
tributions of haplopid and diploid configurations in 
human populations. Figure 8(A) plots the haploid distri- 
bution of chromosome-specific copy numbers inferred 
by our program. Our results indicate that one copy on 
each chromosome is the major allele in the population 
as expected. Zero copy (deletion) is the second frequent 
allele compared with two copies (duplication). Frequen- 
cies of higher chromosome-specific copy numbers are 
relatively lower. This is not unexpected because multiple 
duplication events at the same CNV locus are relatively 
less common. 

In the distribution of diploid configurations (Figure 8 
(B)), 1/1 configurations are the most frequent form as 
expected. We observed 0/2 configurations (deletion 
+duplication) is the second frequent one. This phenom- 
enon may be explained by the fact that 1/1 and 0/2 con- 
figurations contribute equally to gene copy balance in 
humans. In order to assess the miscalled rates of 1/1 
into 0/2 configurations, we conducted a series of simula- 
tion experiments of only 1/1 diplotype configurations 
(i.e., no CNV). Because 1/1 configuration is miscalled to 
0/2 by CSCNPhaser only when the background 



haplotypes are not LD-informative, we investigated the 
miscalled rates of data sets starting from LD-informative 
haplotypes down to low-LD ones. Specifically, the haplo- 
types are one-by-one replaced with non-informative 
haplotypes. Figure 9 plots the miscalled rates with 
respect to the percentage of replaced haplotypes. When 
the majority of haplotypes are LD-informative (>60%), 
the miscalled rate is low (~0.06). As more haplotypes 
are replaced with non-informative ones, the miscalled 
rate goes up as expected. 




Percentage of non-LD Haplotypes 

Figure 9 The miscalled rates of 1/1 configurations to 0/2 
configurations. The miscalled rates of 1/1 configurations to 0/2 
configurations with respect to the percentage of non-informative 
haplotypes. 
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On the other hand, 0/1 and 0/0 configurations (hemi- 
zygous and homozygous deletions) are more frequent 
than the remaining duplication forms, probably due to 
the low frequencies of high-copy alleles. Although these 
distributions are consistent across three HapMap panels, 
these results could be still biased due to low-LD back- 
ground of recurrent or translocation-mediated CNVs. 
Therefore, these distributions are for reference only, 
which require experimental validations before further 
interpretation. 

Capability and Efficiency 

Although the maximum copy number in the population 
is still not clear, it is worth of interest to know the cap- 
ability and efficiency of both programs for processing 
data sets with large copy numbers. Figure 10 plots the 
average running time of CNVPhase and CSCNPhaser 
over a range of maximum copy numbers. Both programs 
are able to accept input of up to 60 copies. The differ- 
ences are the running time and memory usage. 
CNVPhaser requires longer time (>1 min) and more 
memory (>1GB) for >50 copy numbers, whereas 
CSCNPhaser is very fast (within seconds) and does not 
consume much memory. 

Discussion 

Strength and Weakness of LD-based Inference 

CNVPhaser was developed by estimating allele frequen- 
cies using HWE, while our CSCNPhaser investigated the 
haplotype background of each copy number. Although 
not explicitly stated, we observed CNVPhaser implicitly 
capture background haplotypes in a light way, because 
the frequencies are estimated over the entire copy num- 
ber/haplotype combinations. Therefore, its accuracy also 
decreases as LD decays. For CNVs having high LD with 
flanking SNPs, our program performs better than 
CNVPhaser. In low LD regions with only 1/1 configura- 
tions, we observed that CSCNPhaser may miscall them 




5 10 20 30 40 50 60 

Maximum Copy Number 

Figure 10 Comparison of capacity and efficiency The running 
time of CNVPhaser and CSCNPhaser for processing data sets with 
different maximum copy numbers. 



as 0/2 configurations. On the other hand, in data sets 
with mixed configurations (i.e., 1/1 and 0/2), we 
observed that the miscalled rate is lower, because these 
data sets contain haplotypes LD-informative of 0/2 con- 
figuration, which are used for better distinguishing 1/1 
from 0/2 configurations in our algorithm. Most com- 
mon, diallelic CNVs have been found to have strong LD 
with flanking SNPs, and most low-frequency CNVs even 
segregated on specific haplotype background [9]. There- 
fore, we anticipate it is useful to look at the haplotype 
background for inferring copy number of most CNVs. 

It should be noted that the LD-based approach is not 
suitable for recurrent CNVs or translocation-mediated 
CNVs, in which their background haplotypes are less 
informative of the copy number. In fact, our simulation 
on X-chromosome CNVs found two possible recurrent 
CNVs with lower accuracies compared with other ordin- 
ary CNVs (Tables 1 and 2). Nevertheless, the CNVPha- 
ser and our program can work in a hybrid way to 
overcome the limitation. The LD of SNPs across the 
CNV can be computed first. If the LD is low (i.e., recur- 
rent CNVs), it might be a clue for not looking into the 
haplotypes for copy number inference. That is, we can 
run the CNVPhaser but exclude SNP genotypes for pure 
HWE frequency estimation. As for other CNVs with 
LD-information haplotypes, our program can be used to 
achieve higher accuracies. With the release of next gen- 
eration sequencing platforms, SNPs and CNVs are often 
collectively called in each sequencing project. And the 
accuracies of inferring these CNVs can be improved by 
further looking into the LD background of each copy 
number. 

Strategies for Solving Constrained Max-fc-Cut 

The original Max-^r-Cut problem can be solved by a 
randomized algorithm which randomly partition all ver- 
tices into k sets. In addition, it can be also solved by a 
deterministic greedy algorithm which iteratively assigns 
one vertex into the set that maximizes the cut size. In 
fact, both algorithms return a good approximate solu- 
k - 1 

tion within a factor of of the optimal solution, and 

k 

the semidefinite-programming (SDP) relaxation can 
achieve a better approximation bound [30]. Therefore, it 
is natural to consider the three strategies for solving the 



Table 1 Accuracies on X chromosome Simulation 



CNV IDs 


Exp. 1 


Exp. 2 


Exp. 3 


Exp. 4 


Exp. 5 


16 CNVs 


0.788272 


0.828395 


0.869445 


0.91 7284 


0.953704 


CNV2621 


0.585185 


0.618519 


0.61 1111 


0.666667 


0.685185 


CNV2682 


0.314815 


0.222222 


0.359259 


0.292593 


0.285185 



The average accuracies of 16 ordinary CNVs compared with those of two 
recurrent CNVs. 
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Table 2 Copy numbers of the 18 CNVs on X 
chromosomes 



CNVid 


Copy Number 


CNVid 


Copy Number 


2593 


0,1 


2654 


0,1 


2603 


0,1 


2659 


0,1 


2604 


0,1 


2675 


0,1 


2619 


0,1 


2678 


0,1 


2621 


0,1,2 


2682 


1,2,3 


2627 


0,1 


2694 


0,1 


2636 


1,2 


2704 


0,1 


2639 


1,2 


2706 


0,1 


2648 


0,1 


2707 


1,2 



The copy numbers at 18 CNVs on X chromosomes in males. Two possible 
recurrent CNVs are highlighted. 



constrained version. Although not presented in this 
paper, the random partition method was ever considered 
but later withdrawn due to the bad accuracies in all 
experiments. It is because the random partition method 
simply guess the solution, which can't work in practice 
although the approximation ratio is theoretically good. 
The SDP strategy is theoretically sound but the running 
time is slow in our previous study [33], and the SDP 
implementation is complex so that the program is often 
not easily portable to all platforms. On the other hand, 
the greedy algorithm (with randomization enhancement) 
can achieve high accuracies and run very fast in all 
experiments. As a consequence, the greedy solution is 
taken in order to perform genome-wide experiments 
with high accuracies and within reasonable period of 
time. 

Integration of Greedy and Randomized Approaches 

Theoretically, the two optimization problems (Max-/c- 
Cut and Max-2-CSP) can be both solved by a determi- 
nistic greedy approach or a pure randomized approach 
(e.g., random partition for the /c-cut problem). The 
greedy approach is simple and fast. However, the solu- 
tion found is often only theoretically sound but not 
comparable with other heuristic methods in practice. 
This is due to the fact that the ordinary greedy approach 
tends to find local optimum solution instead of global 
optimum solution. On the other hand, the pure rando- 
mized (blind-search) approaches do not have the ten- 
dency of finding local optimum solution but requires 
numerous iterations for finding a good solution. There- 
fore, when solving both algorithms, we used the greedy 
algorithm as a framework and randomized the non- 
deterministic steps for searching better solutions. The 
results showed that the iterations required of this hybrid 
approach are far less than those of pure randomized 
approaches, while obtaining better solutions than an 
ordinary greedy algorithm. 



Conclusion 

In this study, we developed new computational models 
and combinatorial algorithms for inferring chromo- 
some-specific copy numbers by distinguishing their hap- 
lotype background. Simulation showed that our method 
is accurate and outperformed existing method as the 
background haplotypes are LD-informative of the copy 
numbers. The inferred copy numbers are consistent 
with Mendelian inheritance for 97% of CNVs within 
parent-offspring trios. The inference of copy numbers in 
microarray and sequencing platforms are often con- 
founded by a number of different factors. This study 
showed that integration of haplotypes into copy number 
estimation is able to improve the accuracies, especially 
for those CNVs having strong LD with SNPs. 

Additional material 



Additional file 1: Supplementary Figure SI. Simulation of extended 
regions to the left or right of a CNV. The results indicate accuracy is 
highest when the extension is equal to the size of CNV. 

Additional file 2: Supplementary Figure S2. An example of the 
algorithm for solving the constrained Max-k-Cut problem. 

Additional file 3: Supplementary Figure S3 The average accuracies of 
CSCNPhaser with respect to different iterations used. 

Additional file 4: Supplementary Material Other Methods and proofs 
of theorems in this paper. 
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